The objective of this analytical report is to help companies identify good employees who are at risk of leaving the company. With this information, companies can allocate their finances and resources on in areas that can help in retaining good employees.
First, we will analyze and visualize the data to get a basic understanding of the data inhand (Human Resources Analytics by Ludovic Benistant from kaggle.com). After obtaining a basic understanding of the data, we will check the correlation of the factors to identify and interpret the key factors that drive employees to leave.
Second, we will segment the entire employees by using the cluster method to observe which cluster of employees have a higher possbility of leaving.
Finally, we will bucket the employees (excluding the ones who have stayed) across two dimensions, performance and risk of leaving, in order to predict and identify the employees companies generally wish to retain even at a higher cost - high performing employees with high risk of leaving (and maybe even identify the low performing employees with low possiblity of leaving). This will help the company to target and invest in their human resources and reduce the risk and negative impact of losing high performing employees.
First, let’s load the data to use.
ProjectData <- read.csv("./data/HR_data.csv")
ProjectData = data.matrix(ProjectData)
Description of the data Can we slightly rename the titles of the data in the excel file - or is it too complicated now?
This is how the first 10 set of data (employees) look like.
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| satisfaction_level | 0.38 | 0.80 | 0.11 | 0.72 | 0.37 | 0.41 | 0.10 | 0.92 | 0.89 | 0.42 |
| last_evaluation | 0.53 | 0.86 | 0.88 | 0.87 | 0.52 | 0.50 | 0.77 | 0.85 | 1.00 | 0.53 |
| number_project | 2.00 | 5.00 | 7.00 | 5.00 | 2.00 | 2.00 | 6.00 | 5.00 | 5.00 | 2.00 |
| average_montly_hours | 157.00 | 262.00 | 272.00 | 223.00 | 159.00 | 153.00 | 247.00 | 259.00 | 224.00 | 142.00 |
| time_spend_company | 3.00 | 6.00 | 4.00 | 5.00 | 3.00 | 3.00 | 4.00 | 5.00 | 5.00 | 3.00 |
| Work_accident | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| left | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| salary_level | 1.00 | 2.00 | 2.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| sales | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| accounting | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| hr | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| technical | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| support | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| management | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| IT | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| product_mng | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| marketing | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| RandD | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
The data we use here have the following descriptive statistics.
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| satisfaction_level | 0.09 | 0.44 | 0.64 | 0.61 | 0.82 | 1 | 0.25 |
| last_evaluation | 0.36 | 0.56 | 0.72 | 0.72 | 0.87 | 1 | 0.17 |
| number_project | 2.00 | 3.00 | 4.00 | 3.80 | 5.00 | 7 | 1.23 |
| average_montly_hours | 96.00 | 156.00 | 200.00 | 201.05 | 245.00 | 310 | 49.94 |
| time_spend_company | 2.00 | 3.00 | 3.00 | 3.50 | 4.00 | 10 | 1.46 |
| Work_accident | 0.00 | 0.00 | 0.00 | 0.14 | 0.00 | 1 | 0.35 |
| left | 0.00 | 0.00 | 0.00 | 0.24 | 0.00 | 1 | 0.43 |
| promotion_last_5years | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 1 | 0.14 |
| salary_level | 1.00 | 1.00 | 2.00 | 1.59 | 2.00 | 3 | 0.64 |
| sales | 0.00 | 0.00 | 0.00 | 0.28 | 1.00 | 1 | 0.45 |
| accounting | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| hr | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| technical | 0.00 | 0.00 | 0.00 | 0.18 | 0.00 | 1 | 0.39 |
| support | 0.00 | 0.00 | 0.00 | 0.15 | 0.00 | 1 | 0.36 |
| management | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 1 | 0.20 |
| IT | 0.00 | 0.00 | 0.00 | 0.08 | 0.00 | 1 | 0.27 |
| product_mng | 0.00 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.24 |
| marketing | 0.00 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.23 |
| RandD | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
Here, we are standardizing the data in order to avoid having the problem of the result being driven by a few relatively large values. We will scale the data between 0 and 1.
ProjectDataFactor_scaled = apply(ProjectDataFactor, 2, function(r) {
res = (r - min(r))/(max(r) - min(r))
res
})
Below is the summary statistics of the scaled dataset.
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| satisfaction_level | 0 | 0.38 | 0.60 | 0.57 | 0.80 | 1 | 0.27 |
| last_evaluation | 0 | 0.31 | 0.56 | 0.56 | 0.80 | 1 | 0.27 |
| number_project | 0 | 0.20 | 0.40 | 0.36 | 0.60 | 1 | 0.25 |
| average_montly_hours | 0 | 0.28 | 0.49 | 0.49 | 0.70 | 1 | 0.23 |
| time_spend_company | 0 | 0.12 | 0.12 | 0.19 | 0.25 | 1 | 0.18 |
| Work_accident | 0 | 0.00 | 0.00 | 0.14 | 0.00 | 1 | 0.35 |
| left | 0 | 0.00 | 0.00 | 0.24 | 0.00 | 1 | 0.43 |
| promotion_last_5years | 0 | 0.00 | 0.00 | 0.02 | 0.00 | 1 | 0.14 |
| salary_level | 0 | 0.00 | 0.50 | 0.30 | 0.50 | 1 | 0.32 |
| sales | 0 | 0.00 | 0.00 | 0.28 | 1.00 | 1 | 0.45 |
| accounting | 0 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| hr | 0 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
| technical | 0 | 0.00 | 0.00 | 0.18 | 0.00 | 1 | 0.39 |
| support | 0 | 0.00 | 0.00 | 0.15 | 0.00 | 1 | 0.36 |
| management | 0 | 0.00 | 0.00 | 0.04 | 0.00 | 1 | 0.20 |
| IT | 0 | 0.00 | 0.00 | 0.08 | 0.00 | 1 | 0.27 |
| product_mng | 0 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.24 |
| marketing | 0 | 0.00 | 0.00 | 0.06 | 0.00 | 1 | 0.23 |
| RandD | 0 | 0.00 | 0.00 | 0.05 | 0.00 | 1 | 0.22 |
The simplest way to have a first look at a dataset is to check the correlation. By doing this, we can easily see which factors have a high positive/negative correlation with leaving employees. This is different from a causality, therefore we cannot conclude that a highly correlated factor (independent variables) leads an employee to leave (dependent variable). Also, if some of the factors (independent variables) are highly correlated with each other, we could also consider to group these attributes together.
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | salary_level | sales | accounting | hr | technical | support | management | IT | product_mng | marketing | RandD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| satisfaction_level | 1.00 | 0.11 | -0.14 | -0.02 | -0.10 | 0.06 | -0.39 | 0.03 | 0.05 | 0.00 | -0.03 | -0.01 | -0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
| last_evaluation | 0.11 | 1.00 | 0.35 | 0.34 | 0.13 | -0.01 | 0.01 | -0.01 | -0.01 | -0.02 | 0.00 | -0.01 | 0.01 | 0.02 | 0.01 | 0.00 | 0.00 | 0.00 | -0.01 |
| number_project | -0.14 | 0.35 | 1.00 | 0.42 | 0.20 | 0.00 | 0.02 | -0.01 | 0.00 | -0.01 | 0.00 | -0.03 | 0.03 | 0.00 | 0.01 | 0.00 | 0.00 | -0.02 | 0.01 |
| average_montly_hours | -0.02 | 0.34 | 0.42 | 1.00 | 0.13 | -0.01 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | -0.01 | 0.01 | 0.00 | 0.00 | 0.01 | -0.01 | -0.01 | 0.00 |
| time_spend_company | -0.10 | 0.13 | 0.20 | 0.13 | 1.00 | 0.00 | 0.14 | 0.07 | 0.05 | 0.02 | 0.00 | -0.02 | -0.03 | -0.03 | 0.12 | -0.01 | 0.00 | 0.01 | -0.02 |
| Work_accident | 0.06 | -0.01 | 0.00 | -0.01 | 0.00 | 1.00 | -0.15 | 0.04 | 0.01 | 0.00 | -0.01 | -0.02 | -0.01 | 0.01 | 0.01 | -0.01 | 0.00 | 0.01 | 0.02 |
| left | -0.39 | 0.01 | 0.02 | 0.07 | 0.14 | -0.15 | 1.00 | -0.06 | -0.16 | 0.01 | 0.02 | 0.03 | 0.02 | 0.01 | -0.05 | -0.01 | -0.01 | 0.00 | -0.05 |
| promotion_last_5years | 0.03 | -0.01 | -0.01 | 0.00 | 0.07 | 0.04 | -0.06 | 1.00 | 0.10 | 0.01 | 0.00 | 0.00 | -0.04 | -0.04 | 0.13 | -0.04 | -0.04 | 0.05 | 0.02 |
| salary_level | 0.05 | -0.01 | 0.00 | 0.00 | 0.05 | 0.01 | -0.16 | 0.10 | 1.00 | -0.04 | 0.01 | 0.00 | -0.02 | -0.03 | 0.16 | -0.01 | -0.01 | 0.01 | 0.00 |
| sales | 0.00 | -0.02 | -0.01 | 0.00 | 0.02 | 0.00 | 0.01 | 0.01 | -0.04 | 1.00 | -0.14 | -0.14 | -0.29 | -0.26 | -0.13 | -0.18 | -0.16 | -0.15 | -0.15 |
| accounting | -0.03 | 0.00 | 0.00 | 0.00 | 0.00 | -0.01 | 0.02 | 0.00 | 0.01 | -0.14 | 1.00 | -0.05 | -0.11 | -0.10 | -0.05 | -0.07 | -0.06 | -0.06 | -0.05 |
| hr | -0.01 | -0.01 | -0.03 | -0.01 | -0.02 | -0.02 | 0.03 | 0.00 | 0.00 | -0.14 | -0.05 | 1.00 | -0.11 | -0.10 | -0.05 | -0.07 | -0.06 | -0.06 | -0.05 |
| technical | -0.01 | 0.01 | 0.03 | 0.01 | -0.03 | -0.01 | 0.02 | -0.04 | -0.02 | -0.29 | -0.11 | -0.11 | 1.00 | -0.20 | -0.10 | -0.14 | -0.12 | -0.12 | -0.11 |
| support | 0.01 | 0.02 | 0.00 | 0.00 | -0.03 | 0.01 | 0.01 | -0.04 | -0.03 | -0.26 | -0.10 | -0.10 | -0.20 | 1.00 | -0.09 | -0.12 | -0.11 | -0.10 | -0.10 |
| management | 0.01 | 0.01 | 0.01 | 0.00 | 0.12 | 0.01 | -0.05 | 0.13 | 0.16 | -0.13 | -0.05 | -0.05 | -0.10 | -0.09 | 1.00 | -0.06 | -0.05 | -0.05 | -0.05 |
| IT | 0.01 | 0.00 | 0.00 | 0.01 | -0.01 | -0.01 | -0.01 | -0.04 | -0.01 | -0.18 | -0.07 | -0.07 | -0.14 | -0.12 | -0.06 | 1.00 | -0.08 | -0.07 | -0.07 |
| product_mng | 0.01 | 0.00 | 0.00 | -0.01 | 0.00 | 0.00 | -0.01 | -0.04 | -0.01 | -0.16 | -0.06 | -0.06 | -0.12 | -0.11 | -0.05 | -0.08 | 1.00 | -0.06 | -0.06 |
| marketing | 0.01 | 0.00 | -0.02 | -0.01 | 0.01 | 0.01 | 0.00 | 0.05 | 0.01 | -0.15 | -0.06 | -0.06 | -0.12 | -0.10 | -0.05 | -0.07 | -0.06 | 1.00 | -0.06 |
| RandD | 0.01 | -0.01 | 0.01 | 0.00 | -0.02 | 0.02 | -0.05 | 0.02 | 0.00 | -0.15 | -0.05 | -0.05 | -0.11 | -0.10 | -0.05 | -0.07 | -0.06 | -0.06 | 1.00 |
The most significant variable is ‘Satisfaction level’, which is strongly negatively correlated with employees leaving, (input number), which is quite obvious. The satisfaction level is also negatively correlated with time spent at the company, and number of projects. This can be interpreted as ‘the longer the employee has stayed at the company, the lower the level of satisfaction’, which indicates that the company may be lacking in providing long term goals or visions. Being invloved in a lot of projects is also quite highly correlated to employees leaving. However, since long working hours actually have not much correlation with attrition, we can also infer that being invloved in too many tasks and being disorganized and distracted causes lower satisfactory level than simply long working hours.
We use all the variables except “Whether the employee has left.” We use Euclidean distance.
segmentation_attributes_used = c(1:6, 8:19)
profile_attributes_used = c(1:19)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"
Here are the differences between the observations using the distance metric we selected:
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Obs.01 | 0.00 | |||||||||
| Obs.02 | 1.21 | 0.00 | ||||||||
| Obs.03 | 1.39 | 0.89 | 0.00 | |||||||
| Obs.04 | 0.97 | 0.55 | 0.96 | 0.00 | ||||||
| Obs.05 | 0.02 | 1.22 | 1.39 | 0.98 | 0.00 | |||||
| Obs.06 | 0.06 | 1.23 | 1.43 | 0.99 | 0.06 | 0.00 | ||||
| Obs.07 | 1.03 | 0.98 | 0.58 | 0.75 | 1.03 | 1.07 | 0.00 | |||
| Obs.08 | 1.12 | 0.53 | 1.11 | 0.28 | 1.13 | 1.13 | 0.94 | 0.00 | ||
| Obs.09 | 1.17 | 0.60 | 1.12 | 0.28 | 1.18 | 1.19 | 0.97 | 0.29 | 0.00 | |
| Obs.10 | 0.08 | 1.23 | 1.43 | 0.98 | 0.10 | 0.07 | 1.08 | 1.13 | 1.17 | 0 |
We can see the histogram of, say, the first 2 variables.
or the histogram of all pairwise distances for the euclidean distance:
Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:
We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.
For now let’s consider the 4-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:
| Observation Number | Cluster_Membership |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
| 7 | 1 |
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
| 13 | 1 |
| 14 | 1 |
| 15 | 1 |
| 16 | 1 |
| 17 | 1 |
| 18 | 1 |
| 19 | 2 |
| 20 | 1 |
Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.
Let’s see first how many observations we have in each segment, for the segments we selected above:
| Segment 1 | Segment 2 | Segment 3 | Segment 4 | |
|---|---|---|---|---|
| Number of Obs. | 4040 | 6058 | 2692 | 2209 |
The average values of our data for the total population as well as within each customer segment are:
| Population | Segment 1 | Segment 2 | Segment 3 | Segment 4 | |
|---|---|---|---|---|---|
| satisfaction_level | 0.57 | 0.57 | 0.58 | 0.57 | 0.58 |
| last_evaluation | 0.56 | 0.55 | 0.56 | 0.56 | 0.57 |
| number_project | 0.36 | 0.36 | 0.36 | 0.38 | 0.36 |
| average_montly_hours | 0.49 | 0.49 | 0.49 | 0.50 | 0.49 |
| time_spend_company | 0.19 | 0.19 | 0.19 | 0.18 | 0.17 |
| Work_accident | 0.14 | 0.14 | 0.15 | 0.14 | 0.15 |
| left | 0.24 | 0.25 | 0.22 | 0.26 | 0.25 |
| promotion_last_5years | 0.02 | 0.00 | 0.05 | 0.00 | 0.00 |
| salary_level | 0.30 | 0.27 | 0.33 | 0.28 | 0.27 |
| sales | 0.28 | 1.00 | 0.02 | 0.00 | 0.00 |
| accounting | 0.05 | 0.00 | 0.13 | 0.00 | 0.00 |
| hr | 0.05 | 0.00 | 0.12 | 0.00 | 0.00 |
| technical | 0.18 | 0.00 | 0.00 | 1.00 | 0.00 |
| support | 0.15 | 0.00 | 0.00 | 0.00 | 1.00 |
| management | 0.04 | 0.00 | 0.10 | 0.00 | 0.00 |
| IT | 0.08 | 0.00 | 0.20 | 0.00 | 0.00 |
| product_mng | 0.06 | 0.00 | 0.15 | 0.00 | 0.00 |
| marketing | 0.06 | 0.00 | 0.14 | 0.00 | 0.00 |
| RandD | 0.05 | 0.00 | 0.13 | 0.00 | 0.00 |
The segment profile looks to depend too much on department. Let’s try the analysis again excluding department information.
We use all the variables except “Whether the employee has left.” and department. We use Euclidean distance.
segmentation_attributes_used = c(1:6, 8:9)
profile_attributes_used = c(1:19)
numb_clusters_used = 5
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"
Here are the differences between the observations using the distance metric we selected:
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Obs.01 | 0.00 | |||||||||
| Obs.02 | 1.21 | 0.00 | ||||||||
| Obs.03 | 1.39 | 0.89 | 0.00 | |||||||
| Obs.04 | 0.97 | 0.55 | 0.96 | 0.00 | ||||||
| Obs.05 | 0.02 | 1.22 | 1.39 | 0.98 | 0.00 | |||||
| Obs.06 | 0.06 | 1.23 | 1.43 | 0.99 | 0.06 | 0.00 | ||||
| Obs.07 | 1.03 | 0.98 | 0.58 | 0.75 | 1.03 | 1.07 | 0.00 | |||
| Obs.08 | 1.12 | 0.53 | 1.11 | 0.28 | 1.13 | 1.13 | 0.94 | 0.00 | ||
| Obs.09 | 1.17 | 0.60 | 1.12 | 0.28 | 1.18 | 1.19 | 0.97 | 0.29 | 0.00 | |
| Obs.10 | 0.08 | 1.23 | 1.43 | 0.98 | 0.10 | 0.07 | 1.08 | 1.13 | 1.17 | 0 |
Let us skip this subsection for the 2nd try.
Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:
We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.
For now let’s consider the 5-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:
| Observation Number | Cluster_Membership |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 1 |
| 6 | 1 |
| 7 | 3 |
| 8 | 4 |
| 9 | 4 |
| 10 | 1 |
| 11 | 1 |
| 12 | 3 |
| 13 | 4 |
| 14 | 1 |
| 15 | 1 |
| 16 | 1 |
| 17 | 1 |
| 18 | 4 |
| 19 | 3 |
| 20 | 4 |
Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.
Let’s see first how many observations we have in each segment, for the segments we selected above:
| Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|
| Number of Obs. | 1816 | 4172 | 2728 | 4190 | 2093 |
The average values of our data for the total population as well as within each customer segment are:
| Population | Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | |
|---|---|---|---|---|---|---|
| satisfaction_level | 0.57 | 0.37 | 0.71 | 0.28 | 0.70 | 0.61 |
| last_evaluation | 0.56 | 0.26 | 0.60 | 0.61 | 0.61 | 0.55 |
| number_project | 0.36 | 0.04 | 0.37 | 0.57 | 0.36 | 0.36 |
| average_montly_hours | 0.49 | 0.24 | 0.51 | 0.60 | 0.51 | 0.48 |
| time_spend_company | 0.19 | 0.12 | 0.14 | 0.36 | 0.16 | 0.19 |
| Work_accident | 0.14 | 0.00 | 0.00 | 0.03 | 0.00 | 1.00 |
| left | 0.24 | 0.77 | 0.10 | 0.35 | 0.15 | 0.08 |
| promotion_last_5years | 0.02 | 0.00 | 0.00 | 0.12 | 0.00 | 0.00 |
| salary_level | 0.30 | 0.25 | 0.60 | 0.33 | 0.00 | 0.30 |
| sales | 0.28 | 0.29 | 0.25 | 0.29 | 0.29 | 0.27 |
| accounting | 0.05 | 0.06 | 0.05 | 0.06 | 0.05 | 0.05 |
| hr | 0.05 | 0.07 | 0.05 | 0.05 | 0.05 | 0.04 |
| technical | 0.18 | 0.17 | 0.18 | 0.18 | 0.19 | 0.18 |
| support | 0.15 | 0.16 | 0.16 | 0.12 | 0.15 | 0.16 |
| management | 0.04 | 0.03 | 0.05 | 0.07 | 0.02 | 0.04 |
| IT | 0.08 | 0.07 | 0.09 | 0.07 | 0.09 | 0.08 |
| product_mng | 0.06 | 0.06 | 0.06 | 0.05 | 0.07 | 0.06 |
| marketing | 0.06 | 0.06 | 0.06 | 0.06 | 0.05 | 0.06 |
| RandD | 0.05 | 0.04 | 0.06 | 0.05 | 0.05 | 0.06 |
Everyone in the Segment 5 had work accident, which does not look good segmentations. Let’s try the analysis again excluding work accident information.
We use all the variables except “Whether the employee has left,” department, and work accident. We use Euclidean distance.
segmentation_attributes_used = c(1:5, 8:9)
profile_attributes_used = c(1:19)
numb_clusters_used = 4
profile_with = "hclust"
distance_used = "euclidean"
hclust_method = "ward.D"
Here are the differences between the observations using the distance metric we selected:
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Obs.01 | 0.00 | |||||||||
| Obs.02 | 1.21 | 0.00 | ||||||||
| Obs.03 | 1.39 | 0.89 | 0.00 | |||||||
| Obs.04 | 0.97 | 0.55 | 0.96 | 0.00 | ||||||
| Obs.05 | 0.02 | 1.22 | 1.39 | 0.98 | 0.00 | |||||
| Obs.06 | 0.06 | 1.23 | 1.43 | 0.99 | 0.06 | 0.00 | ||||
| Obs.07 | 1.03 | 0.98 | 0.58 | 0.75 | 1.03 | 1.07 | 0.00 | |||
| Obs.08 | 1.12 | 0.53 | 1.11 | 0.28 | 1.13 | 1.13 | 0.94 | 0.00 | ||
| Obs.09 | 1.17 | 0.60 | 1.12 | 0.28 | 1.18 | 1.19 | 0.97 | 0.29 | 0.00 | |
| Obs.10 | 0.08 | 1.23 | 1.43 | 0.98 | 0.10 | 0.07 | 1.08 | 1.13 | 1.17 | 0 |
Let us skip this subsection for the 3rd try.
Let’s use Hierarchical Clustering methods. It may be useful to see the dendrogram from , to have a quick idea of how the data may be segmented and how many segments there may be. Here is the dendrogram for our data:
We can also plot the “distances” traveled before we need to merge any of the lower and smaller in size clusters into larger ones - the heights of the tree branches that link the clusters as we traverse the tree from its leaves to its root. If we have n observations, this plot has n-1 numbers, we see the first 20 here.
For now let’s consider the 4-segments solution. We can also see the segment each observation (respondent in this case) belongs to for the first 20 people:
| Observation Number | Cluster_Membership |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 1 |
| 6 | 1 |
| 7 | 3 |
| 8 | 4 |
| 9 | 4 |
| 10 | 1 |
| 11 | 1 |
| 12 | 3 |
| 13 | 4 |
| 14 | 1 |
| 15 | 1 |
| 16 | 1 |
| 17 | 1 |
| 18 | 4 |
| 19 | 2 |
| 20 | 4 |
Having decided how many clusters to use, we would like to get a better understanding of who the customers in those clusters are and interpret the segments.
Let’s see first how many observations we have in each segment, for the segments we selected above:
| Segment 1 | Segment 2 | Segment 3 | Segment 4 | |
|---|---|---|---|---|
| Number of Obs. | 1930 | 6017 | 2135 | 4917 |
The average values of our data for the total population as well as within each customer segment are:
| Population | Segment 1 | Segment 2 | Segment 3 | Segment 4 | |
|---|---|---|---|---|---|
| satisfaction_level | 0.57 | 0.37 | 0.70 | 0.12 | 0.70 |
| last_evaluation | 0.56 | 0.26 | 0.58 | 0.67 | 0.59 |
| number_project | 0.36 | 0.04 | 0.36 | 0.65 | 0.36 |
| average_montly_hours | 0.49 | 0.24 | 0.50 | 0.65 | 0.51 |
| time_spend_company | 0.19 | 0.12 | 0.20 | 0.29 | 0.15 |
| Work_accident | 0.14 | 0.08 | 0.17 | 0.12 | 0.16 |
| left | 0.24 | 0.76 | 0.08 | 0.47 | 0.13 |
| promotion_last_5years | 0.02 | 0.00 | 0.05 | 0.00 | 0.00 |
| salary_level | 0.30 | 0.26 | 0.57 | 0.26 | 0.00 |
| sales | 0.28 | 0.30 | 0.27 | 0.27 | 0.28 |
| accounting | 0.05 | 0.06 | 0.05 | 0.06 | 0.05 |
| hr | 0.05 | 0.07 | 0.04 | 0.05 | 0.05 |
| technical | 0.18 | 0.17 | 0.17 | 0.19 | 0.19 |
| support | 0.15 | 0.15 | 0.15 | 0.14 | 0.15 |
| management | 0.04 | 0.02 | 0.06 | 0.04 | 0.02 |
| IT | 0.08 | 0.07 | 0.08 | 0.08 | 0.08 |
| product_mng | 0.06 | 0.06 | 0.06 | 0.05 | 0.07 |
| marketing | 0.06 | 0.06 | 0.06 | 0.05 | 0.05 |
| RandD | 0.05 | 0.04 | 0.05 | 0.05 | 0.06 |
dependent_variable = 7
independent_variables = c(1:5, 8:9)
Probability_Threshold = 0.5
estimation_data_percent = 80
validation_data_percent = 10
random_sampling = 0
# Tree (CART) complexity control cp (e.g. 0.001 to 0.02, depending on the
# data)
CART_cp = 0.01
# the minimum size of a segment for the analysis to be done only for that
# segment
min_segment = 100
First, companies can implement policies to control attrition rates, by managing the variables that have a high correlation with employees leaving. For example, companies could work to increase the satisfactory level of employees, especially in the variables that lead to low satisfactory level. To encourage the employees to stay longer, companies could share clear long term visions of the company to help employees envision their future together with the company. Companies could also implement strict working policies where employees would be restricted from joining above a certain number of projects, allowing them to deeply focus on only a few projects and therefore find more meaningful value and satisfaction.
Second, companies can use the prediction model to retain the high performing employees with high risk of leaving. END.